linguistic confidence
Can Large Language Models Express Uncertainty Like Human?
Tao, Linwei, Yeh, Yi-Fan, Kai, Bo, Dong, Minjing, Huang, Tao, Lamb, Tom A., Yu, Jialin, Torr, Philip H. S., Xu, Chang
Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Y et existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we 1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and 2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we 3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we 4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction. The code and dataset are anonymously available at https://anonymous. Large language models (LLMs) are increasingly deployed in real-world applications, from education and healthcare to law and scientific discovery. While their capabilities make them powerful assistants, LLMs are also prone to hallucinations and factual errors, and human overreliance on their outputs can lead to serious consequences. For instance, a U.S. lawyer once submitted fabricated cases generated by ChatGPT, resulting in professional sanctions (ABC News, 2023). Recent social experiments demonstrate that people adjust their reliance on AI depending on how confident the model appears: reliable expressions of uncertainty can enhance trust, satisfaction, and task accuracy (Kim et al., 2024; Xu et al., 2025). These findings highlight the importance of associating reliable uncertainty estimates with LLM responses to support human decision-making. Ultimately, the conveyance of confidence plays a central role in shaping trust and guiding human-AI interaction. A growing body of work explores the extraction and representation of confidence in LLM outputs. These methods are simple and inexpensive but require access to model logits, which are typically unavailable in commercial LLM APIs. However, such scores rarely align with common user behavior or natural communication, as users do not typically phrase queries with explicit instructions like "Please output your confidence along with the answer."
- Government > Regional Government > North America Government > United States Government (0.70)
- Media > Television (0.47)
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation
Shrivastava, Vaishnavi, Liang, Percy, Kumar, Ananya
To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4). As large language models (LLMs) are increasingly deployed, it is important that they signal low confidence on examples where they are likely to make mistakes. This paper's goal is to produce good confidence estimates for state-of-the-art LLMs, which do not provide model probabilities or representations (such as GPT-4 and Claude-v1.3). We first examine a natural idea of eliciting linguistic confidence scores (Tian et al., 2023; Lin et al., 2022; Xiong et al., 2023) -- prompting the LLM to assess its confidence in its answer (Figure 1, GPT-4 Linguistic). We find that linguistic confidences work reasonably well for state-of-the-art models, and much better than a random guessing baseline, but still leave room for improvement (Section 3). Averaged across the datasets, GPT-4 achieves a selective classification AUC of 80.5%, which is 7% above a random guessing baseline. Our results hold across 12 standard datasets (8 MMLU datasets, TruthfulQA, CommonsenseQA, OpenbookQA, and MedQA), 5 models (GPT-4, Claude-v1.3,
- Research Report > Promising Solution (0.54)
- Research Report > New Finding (0.34)
- Information Technology > Security & Privacy (0.69)
- Education (0.46)
Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness
Mielke, Sabrina J., Szlam, Arthur, Boureau, Y-Lan, Dinan, Emily
Open-domain dialogue agents have vastly improved, but still confidently hallucinate knowledge or express doubt when asked straightforward questions. In this work, we analyze whether state-of-the-art chit-chat models can express metacognition capabilities through their responses: does a verbalized expression of doubt (or confidence) match the likelihood that the model's answer is incorrect (or correct)? We find that these models are poorly calibrated in this sense, yet we show that the representations within the models can be used to accurately predict likelihood of correctness. By incorporating these correctness predictions into the training of a controllable generation model, we obtain a dialogue agent with greatly improved linguistic calibration.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Mexico (0.14)
- North America > United States > Pennsylvania (0.04)
- (20 more...)
- Materials > Metals & Mining > Steel (1.00)
- Government > Regional Government > North America Government > United States Government (0.94)
- Health & Medicine (0.94)
- (2 more...)